CONCEPT DIABETES: Analytical pipeline
Introduction
CONCEPT-DIABETES
CONCEPT-DIABETES is part of CONCEPT, a coordinated research project initiative under the umbrella of REDISSEC, the Spanish Research Network on Health Services Research and Chronic Conditions] [www.redissec.com], that aims at analyzing chronic care effectiveness and efficiency in a number of cohorts built on real world data (RWD). In the specific case of CONCEPT-DIABETES, the focus will be on assessing the effectiveness of a set of clinical practice actions and quality improvement strategies at different levels (patient-level, health-care provider level and health system level) on the management and health results of patients with type 2 diabetes (T2D) using process mining methodology.
It is a population-based retrospective observational study centered on all T2D patients diagnosed in four Regional Health Services within the Spanish National Health Service, that includes information from all their contacts with the health services using the electronic medical record systems including Primary Care data, Specialist Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register. We will assess to what extent recommended interventions from evidence-based guidelines are implemented in real life and which are their effects on health outcomes. Process mining methods will be used to analyze the data, and comparison with standard methods will be also conducted.
Cohort
The cohort is defined as patients with type 2 diabetes:
Inclusion criteria: Patients that, at 2017-01-01 or during the follow-up from 2017-01-01 to 2022-12-31, had active health card (active TIA - tarjeta sanitaria activa) and one of the inclusion codes given in the ‘inclusion code list (’T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES).
Exclusion criteria: Patients that had none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) during their follow-up or patients with no contact with the health system from 2017-01-01 to 2022-12-31.
Study period: 2017-01-01 until 2022-12-31.
Treatment guidelines
One of the main intermediate outcome indicators to which clinical guidelines pay special attention is a good glycaemic control, since its absence is clearly related to micro and macrovascular complications. In clinical practice, suboptimal glycaemic control can be mainly attributed to two main reasons: the patients’ non-adherence to prescribed treatment; and the healthcare providers’ clinical or therapeutic guidelines’ non-adherence.
Treatment decisions are based on glycated hemoglobin measurements. In this context the redGDPS foundation provides DM2 treatment algorithm, a diagram that aims to help professionals to quickly and easily choose the most appropriate treatment for people with diabetes.
Process indicators
Another measures to which different studies pay special attention are process indicators. First, process indicators are essential for assessing the three dimensions of healthcare quality: effectiveness, safety, and patient-centeredness. They are measured relatively easily and often do not require risk-adjustment, making their interpretation straightforward. Poor performance on process indicators can be directly attributed to the actions of healthcare providers, providing a clear indication for improvement, such as better adherence to clinical guidelines. Moreover, process indicators allow for the identification of areas for quality improvement, which is crucial in the complex healthcare environment. These indicators cover a wide range of health factors and help focus resources and efforts to improve the health and well-being of the population. Therefore, the attention given to health process indicators is justified by their crucial role in evaluating and improving healthcare quality and population health.
Running code
The python libraries used are:
Code
import sys
import pm4py, subprocess
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import textdistance
import gensim.corpora as corpora
from tqdm import trange, tqdm
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.cluster import KElbowVisualizer
from pm4py.objects.petri_net.obj import PetriNet, Marking
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from gensim.models import LdaModel
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import duckdb
import re
import loggingThe R libraries used are:
Code
library(tidyverse)
library(lubridate)
library(jsonlite)
library(ggplot2)
library(bupaR)
library(processmapR)
library(dplyr)
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(here)
library(survival)
library(lcmm)
library(coxme)
library(muhaz)
library(ggfortify)
library(bayestestR)
library(purrr)
library(duckdb)
library(logger)
library(finalfit)
library(flextable)
library(knitr)Data preprocessing
First of all, it is important to prepare correctly the data for the analysis. The next function creates some sql views that are going to be useful later:
Code
def general_preprocess(con):
'''
Generating views
Args:
con : db connector variable
'''
con.sql("CREATE OR REPLACE VIEW cmbd_incidents_view AS SELECT * FROM main.cmbd \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW ss_use_incidents_view AS SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW comorb_incidents_view AS SELECT * FROM main.comorb \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW param_incidents_view_ AS SELECT DISTINCT * FROM main.param \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01') \
AND (param_value!=0 OR param_name!='bmi')")
con.sql("CREATE OR REPLACE VIEW param_incidents_view AS \
SELECT DISTINCT patient_id,param_name,MAX(param_value) AS param_value,param_date\
FROM param_incidents_view_ GROUP BY patient_id,param_name,param_date")
con.sql("CREATE OR REPLACE VIEW param_cat_incidents_view AS SELECT DISTINCT * FROM main.param_cat \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW patient_incidents_view AS SELECT *,\
FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) \
AS 'age', \
FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) \
AS 'age_dx', \
DATE_ADD(month_nac, INTERVAL 75 YEAR) AS 'turn_to_75', \
DATE_ADD(month_nac, INTERVAL 65 YEAR) AS 'turn_to_65' \
FROM main.patient WHERE patient_id IN (SELECT patient_id \
FROM main.patient WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW treat_incidents_view AS SELECT * FROM main.treat \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW exams_incidents_view AS SELECT * FROM main.exams \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE OR REPLACE VIEW param_incidents_view_age_date AS SELECT param_incidents_view.patient_id, \
param_name, param_value, param_date, turn_to_65, turn_to_75 FROM \
param_incidents_view LEFT JOIN patient_incidents_view ON \
param_incidents_view.patient_id = patient_incidents_view.patient_id")
con.sql("CREATE OR REPLACE VIEW cmbd_incidents_postdx_view AS \
SELECT * FROM cmbd_incidents_view \
WHERE admission_date > (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = cmbd_incidents_view.patient_id)")
con.sql("CREATE OR REPLACE VIEW cmbd_incidents_postdx_first_view AS \
SELECT patient_id, admission_date, diagnosis1_code FROM (SELECT *,\
ROW_NUMBER() OVER(PARTITION BY patient_id \
ORDER BY admission_date) AS rn \
FROM cmbd_incidents_postdx_view) t WHERE rn = 1;")
con.sql("CREATE OR REPLACE VIEW param_incidents_predx_view AS SELECT * FROM (\
SELECT * FROM (SELECT patient_id, param_name, param_value, param_date \
FROM param_incidents_view) sub \
WHERE param_date <= (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id));")
con.sql("CREATE OR REPLACE VIEW param_incidents_predx_last_view AS \
SELECT patient_id, param_name, param_date, param_value \
FROM (SELECT patient_id, param_name, param_date, param_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
ORDER BY param_date DESC) AS rn \
FROM param_incidents_predx_view \
WHERE param_value!=9999) t WHERE rn = 1;")
con.sql("CREATE OR REPLACE VIEW param_cat_incidents_predx_view AS SELECT * FROM (\
SELECT * FROM (SELECT patient_id, param_cat_name, param_cat_value, param_cat_date \
FROM param_cat_incidents_view) sub \
WHERE param_cat_date <= (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id));")
con.sql("CREATE OR REPLACE VIEW param_cat_incidents_predx_last_view AS \
SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
ORDER BY param_cat_date DESC) AS rn \
FROM param_cat_incidents_predx_view) t WHERE rn = 1;")
con.sql("CREATE OR REPLACE VIEW param_prevalents_pre_view AS \
SELECT DISTINCT patient_id, param_name, param_value, param_date FROM main.param \
WHERE param_date < '2017-01-01' AND patient_id IN (\
SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
con.sql("CREATE OR REPLACE VIEW param_prevalents_pre_last_view AS \
SELECT patient_id, param_name, param_date, param_value \
FROM (SELECT patient_id, param_name, param_date, param_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
ORDER BY param_date DESC) AS rn \
FROM param_prevalents_pre_view \
WHERE param_value!=9999) t WHERE rn = 1;")
con.sql("CREATE OR REPLACE VIEW param_cat_prevalents_pre_view AS \
SELECT DISTINCT * FROM main.param_cat \
WHERE param_cat_date < '2017-01-01' AND patient_id IN (\
SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
con.sql("CREATE OR REPLACE VIEW param_cat_prevalents_pre_last_view AS \
SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
ORDER BY param_cat_date DESC) AS rn \
FROM param_cat_prevalents_pre_view) t WHERE rn = 1;") Treatments’ event log creation
For using process mining an event log is needed. The sort of functions below take data model’s treatment and parameter tables to create an event log of glycated hemoglobin measures and treatment of patients. Glycated hemoglobin measurement events are divided into two different states, those that have a value smaller than “L” and the others. L’s value depends on patient’s diabetes duration, age and comorbilities and complications. Therefore with the help of expert clinicians we decided that L would take the following values:
- L=7 if age<75 and no cardiovascular disease, heart failure, chronic kidney disease or fragility.
- L=8 if age<65 and any cardiovascular disease, heart failure, chronic kidney disease or fragility.
- L=8.5 if else.
As they are measures these events do not have any duration. In the case of treatments on the other hand a duration period exists. For treatments, events definition is based on drugs prescriptions if exists or dispensing dates and a fixed period time if else. Functions below make event logs with the previous considerations. Treatment analysis of this document is thought to do by the predominant clinical condition of patients according to DM2 treatment algorithm. In other words, patients are grouped by their predominant clinical condition and the analysis is realized independently to each group.
Code
logging.basicConfig(level=logging.INFO)
def separate_patients_by_condition(con):
'''
Filtering data by predominant clinical condition according to redGDPS
Args:
con : db connector variable
Return:
presc_data_p (float): percentage of not null data in prescription dates
https://www.sciencedirect.com/science/article/pii/S1751991821000176?via%3Dihub#tbl0015
https://www.sanidad.gob.es/estadEstudios/estadisticas/estadisticas/estMinisterio/SIAP/map_cie9mc_cie10_ciap2.htm
https://cienciadedatosysalud.org/wp-content/uploads/Metodologia_Atlas_Prescripcion_enero_2023-1.pdf
'''
cvd_code_list = ['K74','K75','K76','K89','K90','K91']
hf = 'K77'
ckd = 'U99.01'
ob = "T82"
to_list = lambda l: "('"+"','".join(l)+"')"
presc_data_p = con.sql(f" SELECT (COUNT(prescription_date_ini)) / COUNT(*) \
AS p FROM treat_incidents_view").fetchall()[0][0]
cie9_df = pd.read_csv('./CIE9.csv').rename(columns={'CIE9':'CIE',
'BDCAP':'CIAP'})
cie9_df['comorb_codif'] = 'ICD9'
cie10_df = pd.read_csv('./CIE10.csv').rename(columns={'CIE10':'CIE',
'BDCAP':'CIAP'})
cie10_df['comorb_codif'] = 'ICD10'
cie2ciap = pd.concat([cie9_df[['comorb_codif','CIE','CIAP']],
cie10_df[['comorb_codif','CIE','CIAP']]])
con.sql("CREATE OR REPLACE VIEW comorb_incidents_ciap_view AS \
SELECT * FROM comorb_incidents_view \
LEFT JOIN cie2ciap \
ON comorb_incidents_view.comorb_codif = cie2ciap.comorb_codif \
AND comorb_incidents_view.comorb_code = cie2ciap.CIAP")
con.sql("CREATE OR REPLACE VIEW comorb_incidents_active_view AS SELECT *, \
CASE WHEN comorb_codif = 'CIAP2' THEN comorb_code ELSE CIAP END AS CIAP2 \
FROM comorb_incidents_ciap_view WHERE comorb_date_end IS NULL;")
# Which is the predominal clinical condition at baseline?
f_list = set()
ob_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{ob}'"
).df()['patient_id'])
ckd_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{ckd}'"
).df()['patient_id'])
hf_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{hf}'"
).df()['patient_id'])
cvd_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view \
WHERE CIAP2 IN {to_list(cvd_code_list)}"
).df()['patient_id'])
# Which is the predominal clinical condition at dx_date?
# ENFERMEDADES ISQUÉMICAS CARDIACAS (I20-I25)
# ENFERMEDADES CEREBROVASCULARES (I60-I69)
# ISQUEMIA CEREBRAL TRANSITORIA (G45)
# ICTUS (G46)
to_cvd_change_prelist = ['I2%i' %i for i in range(6)]+[
'I6%i' %i for i in range(10)]+[
'G45','G46']
to_cvd_change_list = []
for code in to_cvd_change_prelist:
for c in cie10_df['CIE']:
if code in c:
to_cvd_change_list.append(c)
# INSUFICIENCIA CARDÍACA (I50)
# ENFERMEDAD CARDÍACA Y RENAL CRÓNICA HIPERTENSIVA (I13)
to_hf_change_list = ['I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
'I50.1', 'I50.9']
# ENFERMEDAD RENAL CRÓNICA (N18)
# ENFERMEDAD RENAL CRÓNICA HIPERTENSIVA (I12)
to_ckd_change_list = ['N18','N18.1','N18.2','N18.3','N18.30','N18.31',
'N18.32','N18.4','N18.5','N18.6','N18.9','I12',
'I12.0','I12.9','filtglom']
to_f_change_list = ['f_date']
to_ob_change_list = ['bmi>=30','E66.01','E66.09','E66.1','E66.2',
'E66.8','E66.9']
# SOBREPESO Y OBESIDAD (E66)
from_cvd_change_list = ['death']
from_hf_change_list = from_cvd_change_list+to_cvd_change_list
from_ckd_change_list = from_hf_change_list+to_hf_change_list
from_f_change_list = from_ckd_change_list+to_ckd_change_list
from_ob_change_list = from_f_change_list+to_f_change_list+['bmi<30']
from_else_change_list = from_ob_change_list[:-1]+to_ob_change_list
relevance_order = pd.DataFrame()
relevance_order['code'] = from_ob_change_list+to_ob_change_list+['end']
relevance_order['order'] = range(len(relevance_order))
# https://ukkidney.org/health-professionals/information-resources/uk-eckd-guide/ckd-stages
# Include obesity and kcd detection by parameters values
param_filt = con.sql("SELECT * FROM param_incidents_view \
WHERE param_name = 'filtglom' AND param_value<60;"
).df().sort_values(['patient_id','param_date'])
param_filt['time_diff'] = param_filt.groupby(
'patient_id')['param_date'].diff()
cmbd_filt = param_filt[param_filt['time_diff'] < pd.Timedelta(days=90)
].drop_duplicates(subset='patient_id',keep='first')[
['patient_id','param_date','param_name']
].rename(columns={
'param_name':'code',
'param_date':'admission_date'})
con.sql("CREATE OR REPLACE VIEW cmbd_incidents_long_view AS SELECT *,\
COALESCE(t.order, 404) AS relevance FROM\
(SELECT patient_id, admission_date, diagnosis1_code AS code \
FROM cmbd_incidents_view WHERE diagnosis1_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, diagnosis2_code AS code \
FROM cmbd_incidents_view WHERE diagnosis2_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, diagnosis3_code AS code \
FROM cmbd_incidents_view WHERE diagnosis3_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, code FROM cmbd_filt \
UNION ALL SELECT patient_id, turn_to_75 AS admission_date, 'f_date' \
AS code FROM patient_incidents_view WHERE turn_to_75<'2023-01-01'\
UNION ALL SELECT patient_id, death_date AS admission_date, 'death' \
AS code FROM patient_incidents_view WHERE death_date IS NOT NULL\
UNION ALL SELECT patient_id, DATE '2023-01-01' AS admission_date, 'end' \
AS code FROM patient_incidents_view\
UNION ALL SELECT patient_id,param_date AS admission_date, \
CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
FROM param_incidents_predx_last_view WHERE param_name = 'bmi' \
UNION ALL SELECT patient_id, param_date AS admission_date, \
CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
FROM param_incidents_view sub WHERE param_name = 'bmi' \
AND param_date > (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id)) v \
LEFT JOIN relevance_order t ON v.code=t.code;")
f_list.update(set(con.sql("SELECT DISTINCT patient_id \
FROM patient_incidents_view WHERE turn_to_75<=dx_date"
).df()['patient_id']))
con.sql(f"CREATE OR REPLACE VIEW ob_events AS SELECT * \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_ob_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
)
ob_list.update(set(con.sql("SELECT DISTINCT patient_id \
FROM ob_events").df()['patient_id']))
#Include patients with bmi>=30 in the first 21 days and no bmi data before dx
ob_list.update(set(con.sql(
f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_ob_change_list)} \
AND DATEDIFF('day',(SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id),\
admission_date)<21 \
AND patient_id NOT IN (SELECT DISTINCT patient_id \
FROM ob_events);").df()['patient_id']))
ob_list = ob_list - set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code='bmi<30' \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id) \
AND admission_date > (SELECT MAX(admission_date)\
FROM ob_events \
WHERE ob_events.patient_id = sub.patient_id);"
).df()['patient_id'])
ckd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_ckd_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
hf_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_hf_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
cvd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_cvd_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
cvd_list = list(cvd_list)
hf_list = list(hf_list.difference(cvd_list))
ckd_list = list(ckd_list.difference(cvd_list+hf_list))
f_list = list(f_list.difference(cvd_list+hf_list+ckd_list))
ob_list = list(ob_list.difference(cvd_list+hf_list+ckd_list+f_list))
else_list = list(set(con.sql("SELECT DISTINCT patient_id \
FROM patient_incidents_view"
).df()['patient_id']
).difference(cvd_list+hf_list+ckd_list+f_list+ob_list))
# When is a change on predominant clinical condition made?
con.sql(f"CREATE OR REPLACE VIEW cmbd_incidents_long_view_dx AS SELECT * \
FROM cmbd_incidents_long_view sub \
WHERE admission_date > (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);")
query_function = lambda cond: "(WITH rankedevents_*** AS (\
SELECT patient_id,code,admission_date,ROW_NUMBER() \
OVER(PARTITION BY patient_id ORDER BY admission_date,relevance) AS event_rank \
FROM cmbd_incidents_long_view_dx WHERE patient_id IN {to_list(***_list)} \
AND code IN {to_list(from_***_change_list)} ) \
SELECT patient_id, '***' AS type,code, admission_date AS date \
FROM rankedevents_*** WHERE event_rank=1 AND date<'2023-01-01') \
UNION ALL (WITH rankedevents__*** AS (\
SELECT patient_id,code,admission_date,ROW_NUMBER() \
OVER(PARTITION BY patient_id ORDER BY admission_date) AS event_rank \
FROM cmbd_incidents_long_view_dx WHERE patient_id IN {to_list(***_list)} \
AND code IN {to_list(['end']+from_***_change_list)} ) \
SELECT patient_id, '***' AS type,code, admission_date AS date \
FROM rankedevents__*** WHERE event_rank=1 AND code='end')\
".replace('***',cond)
con.sql(eval('f"CREATE OR REPLACE TABLE patient_condition AS {}"'.format(' UNION ALL '.join(
[query_function(cond) for cond in ['cvd','hf','ckd','f','ob','else']]))))
return presc_data_p
def evlog_creation_by_prescriptions(con,
cond,
code2drug_info_path='./diabetes_drugs.csv'):
'''Preprocessing and event log obtention with prescriptions
Args:
con : db connector variable
cond (str): predominal clinical condition's code
code2drug_info_path (str): drugs' and their codes' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
WHERE type = '{cond}'").df()
dx_date = con.sql(f"SELECT patient_id, dx_date FROM patient_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}')").df()
dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
fin_date = dict(zip(fin_date.patient_id,fin_date.date))
code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
).get('class','NONE2'
).replace('+','&')
treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND substring(atc_code,1,3) = 'A10' \
AND prescription_date_ini IS NOT NULL").df()
treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])
].drop_duplicates(subset=['patient_id',
'prescription_date_ini',
'prescription_date_end',
'Event'])
treat_df['actins'] = range(len(treat_df))
treat_df_start = treat_df[['patient_id','prescription_date_ini',
'Event','actins']].rename(columns =
{'prescription_date_ini':'date'})
treat_df_end = treat_df[['patient_id','prescription_date_end',
'Event','actins']].rename(columns =
{'prescription_date_end':'date'})
treat_df_start['cycle'] = 'start'
treat_df_end['cycle'] = 'end'
treat_df_end['date'] = treat_df_end['date'].fillna(
datetime.strptime('2022-12-31', "%Y-%m-%d")).apply(
lambda x: min([x,datetime.strptime('2023-01-01', "%Y-%m-%d")])-timedelta(days=1))
act_n = max(treat_df['actins'])
if cond in ['cvd','hf','ckd','f']:
param_df = con.sql(f"SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event,\
ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
AND param_date IS NOT NULL").df()
else:
param_df = con.sql(f"SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event,\
ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
AND param_date IS NOT NULL").df()
param_df_start = param_df.copy()
param_df_start['cycle'] = 'start'
param_df_end = param_df.copy()
param_df_end['cycle'] = 'end'
df = pd.concat([param_df_start,treat_df_start,param_df_end,treat_df_end]
).drop_duplicates()
df = df.sort_values(by = ['patient_id','date','Event','cycle'],
ascending = [True,True,True,False])
df.index = range(len(df))
patient_list = list(set(df['patient_id']))
df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
days_after_m = 5
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),dx_date[id]])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,fin_date[id]])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='end')])[0]
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
ev_status = ev_status.astype(int)
#delete treatments in measure events
ev_status = change_matrix_values(ev_status,no_hba,measures,0)
#correction to eliminate events after hemoglobin measurement that are
#the continuation of the previous treatment and not the new treatment.
#Example: If a patient has 'A' treatment and because of hemoglobin's
# measure they change their treatment to 'B', originally their
# trace could appear as A>measure>A>B, but the true trace
# should be A>measure>B. Therefore, in a fixed period time
# after each measurement 'days_after_m', if we detect that
# mistake we delete it.
for j in measures:
if ev_status.shape[1]<=j+1 or days_after_m<3:
continue
first_col = ev_status[:,j+1]
dif_col = np.where(np.any(ev_status[:, j+2:j+days_after_m]!=
first_col[:, np.newaxis],
axis=0))[0]
if (dif_col.size > 0) and (not
np.array_equal(ev_status[:,j-1],
ev_status[:,j+dif_col[0]+2])) and (
np.array_equal(ev_status[:,j-1],
ev_status[:,j+1])):
ev_status = change_matrix_values(ev_status,no_hba,
range(j+1,j+dif_col[0]+2),0)
col = dd.index(dx_date[id])
col_0 = dd.index(dx_date[id])
col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 7
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if ('HBA' not in ev and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if 'HBA' in ev or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
con.sql(f"CREATE OR REPLACE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
def evlog_creation_by_dispensations(con,
cond,
code2drug_info_path='./diabetes_drugs.csv',
nac_path='./Nomenclator_de_Facturacion.csv'):
'''Preprocessing and event log obtention with dispensations
Args:
con : db connector variable
cond (str): predominal clinical condition's code
code2drug_info_path (str): drugs' and their codes' info's table's path
nac_path (str): drugs' and their containers' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
treat_min_days = 2
days_error = 60-treat_min_days
days_after_m = 7
fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
WHERE type = '{cond}'").df()
dx_date = con.sql(f"SELECT patient_id, dx_date FROM patient_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}')").df()
dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
fin_date = dict(zip(fin_date.patient_id,fin_date.date))
code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
).get('class','NONE2'
).replace('+','&')
treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND substring(atc_code,1,3) = 'A10' \
AND dispensing_date IS NOT NULL").df()
treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])]
nac_dict = nac2comprimidos(nac_path,set(treat_df['nac_code']))
treat_df['nac_code'] = treat_df['nac_code'].apply(
lambda x: nac_dict.get(x,days_error))
treat_df['containers_number'] = treat_df['containers_number'].fillna(1)
treat_df['nac_code'] = treat_df['nac_code']*treat_df['containers_number']
treat_df = treat_df[['patient_id','dispensing_date','Event','nac_code']
].rename(columns = {'dispensing_date':'date'}
).sort_values(by = ['patient_id','date' ],
ascending = [True, True])
if cond in ['cvd','hf','ckd','f']:
df = con.sql(f"SELECT *, 'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
FROM (SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
AND param_date IS NOT NULL\
UNION ALL SELECT * FROM treat_df) \
UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
FROM (SELECT patient_id,param_date AS date, param_date,\
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
AND param_date IS NOT NULL\
UNION ALL SELECT patient_id,\
CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
DAYS AS date, date AS param_date, \
Event, nac_code FROM treat_df)").df()
else:
df = con.sql(f"SELECT *, 'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
FROM (SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT * FROM treat_df) \
UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
FROM (SELECT patient_id,param_date AS date, param_date,\
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT patient_id,\
CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
DAYS AS date, date AS param_date, \
Event, nac_code FROM treat_df)").df()
patient_list = list(set(df['patient_id']))
df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]
df = df.sort_values(by = ['patient_id', 'actins','cycle'],
ascending = [True, True, False])
df.index = range(len(df))
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),dx_date[id]])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,fin_date[id]])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
max_nac = int(max(df_id['nac_code'])*1.2)
#event matrix where columns are days and rows treatments
#0:no treatment, 1:treatment, 2:posible treatment,
#3:after measuring period
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = ini+timedelta(days=treat_min_days) if 'HBA' not in ev else ini
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
if 'HBA' not in ev:
possible_day_duration = int(list(
df_id['nac_code'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]*1.2)
until_day = min([dd.index(ini)+possible_day_duration,col])
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(fin),
until_day)),2)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
#event compaction.
#Example: If a patient has 'A' treatment and they are dispensing 'A'
# drug multiple times, their trace could appear as A>A>A>A,
# but the true trace should appear as 'A' treatment lasting
# its corresponding time. Therefore, if the difference between
# the last day and the first day of two consecutive same drugs's
# dispensation is less than the number of tablets of the
# container two events are jointed in one with the first day of
# the first event as initial date and last day of the last
# event as the end date.
ev_status = ev_status.astype(int)
for i in no_hba:
seq = str(list(ev_status[i,:]))
for delta_days in range(1,max_nac):
pattern = ', '+str([1]+[2]*delta_days+[1])[1:-1]
new_pattern = ', '+str([1]+[1]*delta_days+[1])[1:-1]
seq = seq.replace(pattern,new_pattern)
ev_status[i,:] = eval(seq)
ev_status = change_matrix_values(ev_status,no_hba,measures,3)
for i,j in list(itertools.combinations(no_hba,2)):
seq_i = str(list(ev_status[i,:]))
seq_j = str(list(ev_status[j,:]))
for delta_days in range(1,max_nac):
pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_i,seq_i)==None:
continue
pattern_j = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_j = seq_j.replace(pattern_j,new_pattern_j)
for delta_days in range(1,max_nac):
pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_j,seq_j)==None:
continue
pattern_i = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_i = seq_i.replace(pattern_i,new_pattern_i)
ev_status[i,:] = eval(seq_i)
ev_status[j,:] = eval(seq_j)
measures_w_dp = [[i+d_p for d_p in range(days_after_m+1)
if i+d_p<col] for i in measures]
measures_w_dp = list(itertools.chain(*measures_w_dp))
ev_status = change_matrix_values(ev_status,no_hba,measures_w_dp,3)
for i in no_hba:
seq = str(list(ev_status[i,:]))
seq = seq.replace('2','0').replace('3','0')
ev_status[i,:] = eval(seq)
col = dd.index(dx_date[id])
col_0 = dd.index(dx_date[id])
col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 42
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if (ev=='_' and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if ev!='_' or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
con.sql(f"CREATE OR REPLACE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
def change_matrix_values(matrix,list_rows,list_cols, new_value=0):
'''change selected rows' and columns' matrix`s values
Args:
matrix (array): matrix wanted to modify
list_rows (list): matrix's rows wanted to modify
list_cols (list): matrix's columns wanted to modify
new_value (int): new wanted value
Output:
matrix (array): modified matrix
'''
pos = list(itertools.product(list_rows,list_cols))
for i,j in pos:
matrix[i,j] = new_value
return matrix
def col2treat(matrix,rows,col):
'''translate treatments from a binary vector
Args:
matrix (array): events calendary in binary information
rows (str): events list
col (int): column of matrix wanted to translate
Output:
treatment (str):
'''
treatment = '+'.join(sorted([rows[ev
] for ev in range(len(rows))
if matrix[ev,col]!=0]))
if treatment!='':
return treatment
else:
return '_'
def nac2comprimidos(nac_path,code_set):
'''Creating dictionary of nac drugs container's number of tablets
Args:
nac_path (str): nac table's path
code_set (set): nac code list
Outputs:
nac_dict (dic): nac codes as keys and their number of tablets as values
'''
nac = pd.read_csv(nac_path)
nac['Código Nacional'] = nac['Código Nacional'].astype(str)
nac_dict = {}
for n in range(len(nac)):
if nac['Código Nacional'][n] in code_set:
text = nac['Nombre del producto farmacéutico'][n].lower()
if re.search(r'\d+(?=\s*comprim)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)', text).group())
elif re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text))!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text)).group())
elif re.search(r'con pelicula (\d+)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'con pelicula (\d+)',
text).group().split()[-1])
return nac_dictProcess indicators’ event log creation
With the objective of studying process indicators a function to make a process indicators’ event log is coded above. There, in addition to laboratory measures and realized exams, three artificial events have been included to aid the analysis.’INI’ event denotes each patient’s diabetes diagnosis date, ‘yFIN’ events indicate the end of each annual interval and ‘FIN’ event is the date of the end of the last completed annual interval.
Code
def evlog_process_ind(con):
'''Generation of process indicators' eventlog
Args:
con : db connector variable
'''
con.sql("CREATE OR REPLACE VIEW pre_process_ind AS SELECT *,'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id,date) AS actins \
FROM (SELECT DISTINCT patient_id,'hba1c' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 60 SECOND AS date \
FROM param_incidents_view WHERE param_name = 'hba1c' \
UNION ALL SELECT DISTINCT patient_id,'col' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 120 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('col','hdl','ldl') \
UNION ALL SELECT DISTINCT patient_id,'blood_p' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 180 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('sbp','dbp') \
UNION ALL SELECT DISTINCT patient_id,'indalbcr' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 240 SECOND AS date \
FROM param_incidents_view WHERE param_name='indalbcr' \
UNION ALL SELECT DISTINCT patient_id,'filtglom' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 300 SECOND AS date \
FROM param_incidents_view WHERE param_name='filtglom' \
UNION ALL SELECT DISTINCT patient_id,'bmi' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 360 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('bmi','weight') \
UNION ALL SELECT DISTINCT patient_id,'ocular_exam' AS event, \
CAST(test_date AS TIMESTAMP) + INTERVAL 420 SECOND AS date \
FROM exams_incidents_view WHERE test_name='ocular_exam_PC' \
UNION ALL SELECT DISTINCT patient_id,'foot_exam' AS event, \
CAST(test_date AS TIMESTAMP) + INTERVAL 480 SECOND AS date \
FROM exams_incidents_view WHERE test_name='foot_exam' \
UNION ALL SELECT patient_id,'INI' AS event, \
CAST(dx_date AS TIMESTAMP) AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 365 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 730 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1095 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1461 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1826 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'FIN' AS event,\
CAST(COALESCE(deregistration_date,'2023-01-01') AS TIMESTAMP) AS date \
FROM patient_incidents_view)")
con.sql("CREATE OR REPLACE TABLE process_ind AS SELECT a.* FROM pre_process_ind a \
JOIN (SELECT patient_id, MIN(date) as ini_date,MAX(date) as fin_date \
FROM pre_process_ind d WHERE event IN ('INI', 'yFIN') AND \
date<(SELECT c.date FROM pre_process_ind c \
WHERE c.event = 'FIN' AND c.patient_id = d.patient_id) \
GROUP BY patient_id) b ON a.patient_id = b.patient_id \
WHERE (a.date BETWEEN b.ini_date AND b.fin_date) OR a.event = 'FIN'")
con.sql("CREATE OR REPLACE TABLE process_ind1 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 1 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE OR REPLACE TABLE process_ind2 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 2 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE OR REPLACE TABLE process_ind3 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 3 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE OR REPLACE TABLE process_ind4 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 4 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE OR REPLACE TABLE process_ind5 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 5 AND (A.date <= R.date OR A.event = 'FIN'))")
Distances
One of the most important aim of process mining is to show and explain the processes. However, the great variety of traces does not allow us to draw any clear conclusions and it is often necessary to simplify our data. Another option that we can do before simplifying, to avoid the excessive losing of information and give another perspective to the analysis, is to cluster the traces and to analyze them by cluster. To do that we have to measure somehow the differences between the traces; the distance between them. There are some distances that we can use to this task: edit distances, vector term similarity, LDA based distances, embedding based distances… Some of them are shown below as functions to calculate the distance matrix of traces:
Code
####### EDIT DISTANCE #######
def calculate_dm_ED(traces,measure_f):
'''Calculate distance matrix with some edit distance.
Args:
traces (list): patients' traces
measure_f: some edit distance function
Returns:
dm: distance matrix
'''
id2word = corpora.Dictionary(traces)
traces_e = [[id2word.token2id[t[n]] for n in range(len(t))] for t in traces]
traces_e_str = list(set([str(traces_e[i]
) for i in range(len(traces_e))]))
len_t_r = len(traces_e_str)
len_t = len(traces_e)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = measure_f(traces_e[0],traces_e[0])
d_dic = {str(t):dict() for t in traces_e_str}
for i in range(len_t_r):
d_dic[traces_e_str[i]][traces_e_str[i]] = same
for j in range(i+1,len_t_r):
d_ij = measure_f(eval(traces_e_str[i]),
eval(traces_e_str[j]))
d_dic[traces_e_str[i]][traces_e_str[j]] = d_ij
d_dic[traces_e_str[j]][traces_e_str[i]] = d_ij
for i in range(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
t_i = str(traces_e[i])
t_j = str(traces_e[j])
d_ij = d_dic[t_i][t_j]
dm[i][j] = d_ij
dm[j][i] = d_ij
if same == 1:
dm = 1 - dm
return dm
####### TERM VECTOR SIMILARITY #######
def calculate_dm_TV(traces):
'''Calculate distance matrix with term vector similarity.
Args:
traces (list): list of traces
Returns:
dm (array): distance matrix
vectorizer: TfidfVectorizer
X: traces vectorized with TfidVectorizer
'''
corpus = [' '.join(t) for t in traces]
vectorizer = TfidfVectorizer(tokenizer=str.split)
X = vectorizer.fit_transform(corpus)
print('calculatin dm ...')
dm = np.asarray(np.matmul(X.todense(),X.todense().T))
dm = 1 - dm.round(decimals=4)
return dm, vectorizer, X
####### LDA BASED DISTANCE #######
def calculate_dm_LDA(traces,T=10):
'''Calculate distance matrix with LDA model.
Args:
traces (list): list of traces
T (int): number of topics of LDA model
Returns:
dm (array): distance matrix
lda_model: LdaModel
id2word (dict): tokenized events as keys and events by values
'''
# Create Dictionary
id2word = corpora.Dictionary(traces)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in traces]
# Make LDA model
lda_model = LdaModel(corpus=corpus,
id2word=id2word,
num_topics=T,
alpha = 1,
eta = 'auto',
random_state = 123)
get_c_topic = np.array(
lda_model.get_document_topics(corpus,minimum_probability = -0.1))
sigma = np.asarray([[get_c_topic[i][j][1]
for j in range(T)] for i in range(len(corpus))])
sigma2 = np.asarray(np.matmul(sigma,sigma.T))
len_t = len(traces)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = sigma2[0][0]/np.sqrt(sigma2[0][0]*sigma2[0][0])
for i in trange(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
d_ij = sigma2[i][j]/np.sqrt(sigma2[i][i]*sigma2[j][j])
dm[i][j] = d_ij
dm[j][i] = d_ij
dm = 1-dm
return dm, lda_model, id2wordClustering
Once the distance matrix is obtained, we can proceed with the clustering. Each clustering method has its characteristics and its peculiarities. For example, we have to consider we have a distance matrix and not a data frame in which we apply directly the method. A hierarchical clustering algorithm seems to be a good choice in our case because in addition to the above it allows to choose the optimal number of clusters.
The next code box contains two different functions to choose the optimal number of clusters:
Code
def dendogram(dm,output_png='../../outputs/dendogram.png'):
'''Plot and save dendogram.
Args:
dm (array): distance matrix
output_png (str): saved dendogram's path
'''
dm_condensed = squareform(dm)
matrix = linkage(
dm_condensed,
method='complete'
)
sys.setrecursionlimit(10000000)
dn = dendrogram(matrix,truncate_mode='lastp',p=80)
sys.setrecursionlimit(1000)
plt.title('Dendrogram')
plt.ylabel('Distance')
plt.xlabel('Patients traces')
plt.savefig(output_png)
plt.clf()
def kelbow(dm,
cond,
elbow_metric='distortion',
locate_elbow=False,
output_path='../../outputs/'):
'''Plots to choose optimal clusters.
Args:
dm (array): distance matrix
cond (str): predominal clinical condition's code
elbow_metric (str): name of the method
locate_elbow (boolean): True if want to return optimal number of clusters
Returns:
k_opt (int)(optional): optimal number of clusters according to method
'''
model = AgglomerativeClustering(metric = "precomputed",
linkage = 'complete')
# k is range of number of clusters.
visualizer_ = KElbowVisualizer(model,
k=(2,25),
timings=False,
xlabel = 'cluster numbers',
metric=elbow_metric,
locate_elbow=locate_elbow)
# Fit data to visualizer
output_file = output_path+elbow_metric+'_'+cond+'.png'
visualizer_.fit(dm)
# Finalize and render figure
visualizer_.show(outpath=output_file,clear_figure=True)
k_opt=None
if locate_elbow:
k_opt = visualizer_.elbow_value_
return k_optThe function ‘clust’ clusterizes traces in prefixed number of clusters:
Code
def clust(clust_n,dist_matrix,df_,id2trace,patients):
'''clusterize distance matrix.
Args:
clust_n (int): number of clusters obtained
dist_matrix (array): distance matrix
df_ (dataframe): event log
id2trace (dict): patient ids as keys and their traces as values
patients (list): patients' ids in same order as in dm
Returns:
df_ (dataframe): dataframe with patients and their clusters
'''
traces = list(id2trace[id] for id in sorted(id2trace.keys()))
model = AgglomerativeClustering(n_clusters=clust_n,
metric = "precomputed",
linkage = 'complete')
model.fit(dist_matrix)
labels = model.labels_
cluster_list ={id: labels[traces.index(id2trace[id])
] for id in patients}
df_['cluster'] = [cluster_list[df_['ID'][i]] for i in range(len(df_))]
return df_Descriptive analysis of treatments’ eventlog
The implementation below is made to show the most frequent traces in each cluster:
Code
def make_data_dict(log,top_k,col_id):
'''Obtain most frequent traces and their statistics
Args:
log (dataframe): event log
top_k (int): number of traces want to show
col_id (str): patients id column's name in df_
Returns:
data_dict (dict): traces as keys and ther statistics as values
'''
len_id = len(set(log[col_id]))
log_freq = pm4py.stats.get_variants(log)
freq_list = [(t,log_freq[t],len(t)) for t in set(log_freq.keys())]
trace = [list(t[0]) for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
cases = [t[1] for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
top_k = min(top_k,len(cases))
percentage = [100*cases[c]/len_id for c in range(top_k)]
cum_percentage = [sum(percentage[:p+1]) for p in range(top_k)]
data_dict = {"Trace": trace,
"Percentage": percentage,
"Cases": cases,
"Cumulative Percentage": cum_percentage}
return data_dict
def update_color_dict(color_dict,data_dict):
'''update of the color dict to include new events
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and ther statistics as values
Returns:
color_dict (dict): events as keys and colors as values
'''
cmap = plt.cm.get_cmap('tab20')
for event in set(itertools.chain.from_iterable(data_dict['Trace'])):
if event not in color_dict and len(color_dict)==20:
cmap = plt.cm.get_cmap('tab20b')
if event not in color_dict:
try:
color_dict.update({event:cmap(len(color_dict))})
except:
color_dict.update({event:cmap(2*(len(color_dict)-20))})
return color_dict
def trace_plotter(data_dict,
color_dict,
acronym,
output_file,
font_size=10,
percentage_box_width=0.8,
size=(15,9)):
'''configuration of the trace_explorer plot
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and their statistics as values
acronym (dict): events as keys and their acronyms as values
output_file (str): figure's path
font_size (int): font size
percentage_box_width (float): event boxes' width
size (tuple): figure's size
'''
fig, ax = plt.subplots(figsize=size)
percentage_position = max(len(t) for t in data_dict["Trace"]
) + percentage_box_width*3 +0.5
for row, (trace, percentage,cases,cum_percentage
) in enumerate(zip(data_dict["Trace"],
data_dict["Percentage"],
data_dict["Cases"],
data_dict["Cumulative Percentage"]),
start=1):
for col, acr in enumerate(trace, start=1):
ax.add_patch(plt.Rectangle((col - 0.5, row - 0.45), 1, 0.9,
facecolor=color_dict[acr],
edgecolor='white'))
ax.text(col,
row,
acr,
ha='center',
va='center',
color='white',
fontsize = font_size,
fontweight='bold')
ax.add_patch(plt.Rectangle((
percentage_position -percentage_box_width*2.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width*2,
row,
f'{percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.add_patch(plt.Rectangle((
percentage_position - percentage_box_width*1.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width,
row,
f'{cases}',
ha='center',
va='center',
color='white',
fontsize = font_size+4)
ax.add_patch(plt.Rectangle((percentage_position-percentage_box_width*0.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position,
row,
f'{cum_percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.set_xlim(0.5, percentage_position+0.5)
ax.set_xticks(range(1, int(percentage_position-1)))
ax.set_ylabel('Traces',fontsize = font_size+3)
ax.set_ylim(len(data_dict["Trace"]) + 0.45, 0.55) # y-axis is reversed
ax.set_yticks([])
ax.set_xlabel('Activities',fontsize = font_size+3)
handles = [plt.Rectangle((0, 0), 0, 0, facecolor=color_dict[acr],
edgecolor='black', label=acronym[acr])
for acr in acronym if acr in set(
itertools.chain.from_iterable(data_dict['Trace']))]
ax.legend(handles=handles,
bbox_to_anchor=[1.02, 1.02],
loc='upper left',
fontsize = font_size+6)
for dir in ['left', 'right', 'top']:
ax.spines[dir].set_visible(False)
plt.tight_layout()
plt.savefig(output_file)
plt.close()
def trace_explorer(con,
cond,
top_k=5,
id_col='ID',
ev_col='Event',
date_col='date',
clust_col='cluster',
color_dict={}):
'''Plot each clusters most frequent traces
Args:
con : db connector variable
cond (str): predominal clinical condition's code
top_k (int): traces as keys and their statistics as values
id_col (str): patients id column's name in evlog_file
ev_col (str): events column's name in evlog_file
date_col (str): events dates column's name in evlog_file
clust_col (str): cluster column's name in evlog_file
color_dict (dict): events as keys and colors as values
'''
log_ = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered").df(),
case_id=id_col,
activity_key=ev_col,
timestamp_key=date_col)
log_ = log_.sort_values([id_col,date_col])
log_ = log_[log_['cycle']=='start']
for clust in set(log_[clust_col]):
log = log_[log_[clust_col]==clust]
len_id = len(set(log[id_col]))
acronym = {t:t for t in sorted(set(log[ev_col]))}
data_dict = make_data_dict(log,top_k,id_col)
color_dict = update_color_dict(color_dict, data_dict)
trace_plotter(data_dict,color_dict,acronym,
'../../outputs/t_cluster_%s_%i.png' % (cond,clust))
return color_dictTo get the process maps of each cluster next R functions can be used:
Code
load_log <- function(con,
query,case_id="ID",
activity_id="Event",
lifecycle_id="cycle",
activity_instance_id="actins",
timestamp="date"){
eventlog <- dbGetQuery(con,query)
eventlog = eventlog[order(eventlog$ID),]
#To transform date to a format we can work with
eventlog$date = as.POSIXct(eventlog$date, tz = "", format="%Y-%m-%d" ,
tryFormats = c("%Y/%m/%d",
"%Y/%m/%d"),
optional = FALSE)
evLog = eventlog %>%
mutate(resource = NA) %>%
mutate(cycle = fct_recode(cycle,
"start" = "start",
"complete" = "end")) %>%
eventlog(
case_id = case_id,
activity_id = activity_id,
lifecycle_id = lifecycle_id,
activity_instance_id = activity_instance_id,
timestamp = timestamp,
resource_id = 'resource'
)
return(evLog)
}
make_process_map <- function(log,t_freq,output_file){
log %>%
filter_activity_frequency(percentage = 1) %>% # show only most frequent
filter_trace_frequency(percentage = t_freq) %>%
process_map(type_nodes = performance(mean,units='days'),
sec_nodes = frequency('relative_case'),
type_edges = frequency('absolute'),
sec_edges = frequency('relative_case'),
render = T) %>%
export_svg %>%
charToRaw %>%
rsvg_png (output_file,width=2000)
}
process_map_by_cluster <- function(evLog,t_freq,cond_code){
for (clust in unique(evLog$cluster)) {
log <- evLog %>%
filter(cluster == clust)
make_process_map(log,t_freq,gsub("src/analysis-scripts/",
"",here("outputs",sprintf(
"pm_cluster_%s_%d.png",
cond_code, clust) )))
}
}Conformance checking
Conformance checking is a technique used to check process compliance by comparing event logs for a discovered process with the existing reference model (target model) of the same process. Basing on the DM2 treatment algorithm previous shown, with a software called Carassius , we created the next Petri Nets that are going to be useful as treatment guidelines in reference to glycated hemoglobin measures.
Fitness is the metric that measures how much a trace is distanced from a given process model, or from the guidelines in this case. There are different methods to calculate this metric but in the code below is used the aligned fitness. Since in this metric the initial marking and the final marking have to be fixed we included the events ‘INI’ and ‘FIN’ in the Petri Net and in each trace. Adding this artificial events allows us to compare all traces fitness in the same conditions.
Code
def id2treat_fitness(log ,
net,
initial_marking,
final_marking,
clust_col='cluster',
date_col='date',
ev_col='Event'):
'''Obtain traces fitness
Args:
log (dataframe): event log
net: petri net
initial_marking: initial place in the petri net
final_marking: final place in the petri net
clust_col (str): cluster column's name in log
date_col (str): events dates column's name in log
ev_col (str): events column's name in log
Returns:
df (dataframe): traces, their clusters and their fitnesses
'''
id2trace = {id:list(log['Event'][log['case:concept:name']==id]
) for id in set(log['case:concept:name'])}
id2ids = {id:[id2 for id2 in set(id2trace.keys()) if id2trace[id]==id2trace[id2]
] for id in set(log['case:concept:name'])}
for id in set(id2ids.keys()):
try:
for id2 in id2ids[id]:
if id2!=id:
del id2ids[id2]
except:
continue
id2fitness = dict()
for name in set(id2ids.keys()):
log2 = log.drop(log.index[log['case:concept:name'] !=name])
new = log2.copy().iloc[[0, -1]]
date_list = list(new[date_col])
index = list(new['@@index'])
actins = list(new['actins'])
clust = new[clust_col].values[0]
new[date_col] = new['time:timestamp'] = [date_list[0]-timedelta(days=1),
date_list[1]+timedelta(days=1)]
new[ev_col] = new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
log2 = pd.concat([log2,new]).sort_values('time:timestamp')
aligned_fitness = pm4py.conformance_diagnostics_alignments(
log2, net, initial_marking, final_marking)[0]['fitness']
for id in id2ids[name]:
id2fitness[id] = {'ID':id,
'aligned_fitness':aligned_fitness,
clust_col: clust}
df = pd.DataFrame(id2fitness).T
df.index = range(len(df))
return dfThe function below takes a treatments’ event log and returns each patient’s period’s fitness.
Code
def id2treat_fitness_by_interval(con,
cond,
pn_file,
ini_place='place100',
fin_place='place111',
date_n_col='date',
fixed_period_time=90):
'''Obtain traces fitness by intervals of fixed_period_time days
Args:
con: db connector variable
cond (str): predominal clinical condition's code
pn_file (str): petri net's path
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
date_n_col (str): events dates column's name in log
fixed_period_time (int): number of days in each interval
'''
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
event_log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust \
WHERE cycle = 'start'").df(),
case_id='ID',
activity_key='Event',
timestamp_key=date_n_col)
event_log = event_log[event_log['cycle']=='start']
baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
endline = datetime.strptime('2023-01-01', "%Y-%m-%d")
dd = [baseline + timedelta(days=x) for x in range((
endline-baseline).days + 1)]
dd_INI = dd[0]-timedelta(days=1)
dd_FIN = dd[-1]+timedelta(days=1)
event_log['time:timestamp'] = event_log['time:timestamp'].dt.tz_localize(None)
event_log[date_n_col] = event_log[date_n_col].dt.tz_localize(None)
event_log['date_'] = event_log[date_n_col].apply(
lambda x: dd.index(x))
event_log[date_n_col] = event_log[date_n_col].apply(
lambda x: dd.index(x))
event_log[date_n_col] = event_log.groupby('ID')[date_n_col].apply(
lambda x: x-x.min())
hosp = con.sql("SELECT * FROM cmbd_incidents_postdx_first_view").df()
hosp['admission_date'] = hosp['admission_date'].dt.tz_localize(None).apply(
lambda x: dd.index(x))
hosp = dict(zip(hosp.patient_id,hosp.admission_date))
period2fitness = pd.DataFrame()
id_list_df = []
p_start = []
p_end = []
fitness = []
date_0 = []
status = []
id_list = list(set(event_log['ID']))
for id in tqdm(id_list):
event_log_id = event_log[event_log['ID']==id]
hosp_d = hosp.get(id,100000)-min(event_log_id['date_'])
date_max = min(max(event_log_id[date_n_col]),hosp_d)
date_min = min(event_log_id['time:timestamp'])
n_periods = date_max//fixed_period_time
n_periods_r = date_max/fixed_period_time
if date_max<=fixed_period_time:
continue
for n in range(1,n_periods+1):
event_log_id_n = event_log_id[
event_log_id[date_n_col]<n*fixed_period_time]
new = event_log_id.copy().iloc[[0, -1]]
actins = list(new['actins'])
index = list(new['@@index'])
new['time:timestamp'] = [dd_INI,dd_FIN]
new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
event_log_id_n = pd.concat([event_log_id_n,new]
).sort_values('time:timestamp')
aligned_fitness = pm4py.conformance_diagnostics_alignments(
event_log_id_n, net,
initial_marking, final_marking)[0]['fitness']
id_list_df.append(id)
p_start.append((n-1)*fixed_period_time)
p_end_value = n*fixed_period_time
if n==n_periods:
p_end_value = date_max-fixed_period_time
p_end.append(p_end_value)
fitness.append(aligned_fitness)
date_0.append(date_min)
if (n==n_periods or n==n_periods_r-1) and hosp_d==date_max :
status.append(1)
break
else:
status.append(0)
period2fitness['ID'] = id_list_df
period2fitness['t_0'] = p_start
period2fitness['t_1'] = p_end
period2fitness['fitness'] = fitness
period2fitness['ini_date'] = date_0
period2fitness['status'] = status
con.sql(f"CREATE OR REPLACE TABLE period2fitness_{cond} AS SELECT *,\
MAX(status) OVER (PARTITION BY ID) AS status2 FROM period2fitness \
WHERE t_0!=t_1") In the next function is shown a boxplot function to show clusters’ fitness distribution.
Code
def treatments_clusters_boxplot(con,
cond,
pn_file,
pn_png_file,
ini_place='place100',
fin_place='place111',
date_col='date',
ev_col='Event',
clust_col='cluster',
output_png='../../outputs/fitness_by_cluster.png'):
'''Barplot of the fitness of each cluster
Args:
con: db connector variable
cond (str): predominant clinical condition
pn_file: petri net's path
ini_marking: initial place in the petri net
fin_marking: final place in the petri net
date_col (str): events dates column's name in log
clust_col (str): cluster column's name in log
ev_col (str): events column's name in log
output_png (str): created figure's path
'''
log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered \
WHERE cycle = 'start'").df(),
case_id='ID',
activity_key=ev_col,
timestamp_key=date_col)
log = log.sort_values(date_col)
log.index = range(len(log))
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
df = id2treat_fitness(log,net,initial_marking,final_marking,
clust_col,date_col,ev_col)
df['aligned_fitness'] = pd.to_numeric(df['aligned_fitness'])
df['cluster'] = pd.to_numeric(df['cluster'])
df['ID'] = df['ID'].astype("string")
con.sql(f"CREATE OR REPLACE TABLE eventlog_{cond}_byclust AS \
SELECT * FROM df")
data = [list(df['aligned_fitness'][df[clust_col]==i])
for i in sorted(set(df[clust_col]))]
fig = plt.figure(figsize =(10, 7))
# Creating axes instance
ax = fig.add_axes([0, 0, 1, 1])
# Creating plot
bp = ax.boxplot(data,labels=[i for i in sorted(set(df[clust_col]))])
plt.xlabel("Clusters")
plt.ylabel("Aligned Fitness")
# show plot
plt.savefig(output_png,bbox_inches='tight')
plt.close(fig)
Process indicators can also be represented in a Petri Net. Therefore we created the next Petri Nets to analyze whether the traces obey the guidelines the first five years:
The function below takes a process indicators’ event log and returns each patient’s fitness by year. The method to calculate the fitness in this case is the token base replay method.
Code
def id2process_fitness(con,
query,
pn_file,
pn_png_file,
ini_place='place37',
fin_place='place148',
id_col = 'patient_id',
event_col = 'event',
date_col='date',):
'''Calculate the fitness of a given process indicators' event log
Args:
con: db connector variable
cond (str): predominal clinical condition's code
query (str): query to select event log
pn_file (str): petri net's path
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
id_col (str): ids' column's name in event log
event_col (str): events' column's name in event log
date_col (str): events' dates' column's name in event log
Returns:
df (dataframe): dataframe of ids' fitness
'''
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
event_log = pm4py.format_dataframe(con.sql(query).df(),
case_id=id_col,
activity_key=event_col,
timestamp_key=date_col)
date_list = []
fit_list = []
id_list = list(set(event_log[id_col]))
for id in tqdm(id_list):
log = event_log.drop(event_log.index[event_log['case:concept:name'] !=id]
).sort_values('time:timestamp')
fit_obj = pm4py.conformance.conformance_diagnostics_token_based_replay(
log, net, initial_marking, final_marking)
date_list.append(list(log['time:timestamp'])[-2])
fit_list.append(fit_obj[0]['trace_fitness'])
log_0 = log[log['concept:name'].isin(['INI','FIN','yFIN'])]
alfa = pm4py.conformance.conformance_diagnostics_token_based_replay(
log_0, net, initial_marking, final_marking)[0]['trace_fitness']
df = pd.DataFrame()
df['ID'] = id_list
df['fitness'] = [(x-alfa)/(1-alfa) for x in fit_list]
df['date'] = date_list
return dfTime dependent Cox model
The link between guideline adherence, in terms of performed process measures, and clinical outcomes is a highly demanded issue in diabetes care. A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. One of the strengths of the Cox model is its ability to encompass covariates that change over time, such as treatment adherence. A time dependent Cox model can be made using each patient trace’s fitness at different time interval.
Joint Latent Class Model
Joint models are used to analyse simultaneously two related phenomena, the evolution of a variable and the occurence of an event. Joint latent class models (JLCM) consist of a linear mixed model and a proportional hazard model linked by the latent classes. The population is split in several groups, the latent classes, and each class is caracterized by a specific evolution of the dependent variable and an associated risk of event. Using fitness as time dependent treatment adherence measure we can made a joint latent class model.
Results
Cohort description
Code
###INCIDENTS
patient_inc <- dbGetQuery(con,
"SELECT *,
CASE WHEN sex = 0 THEN 'Male'
ELSE 'Female' END AS sexo FROM patient_incidents_view")
patient_inc <- patient_inc %>%
left_join( country2cont %>% select(country, CC), by = "country") %>%
mutate(education=factor(education, levels=c(0,1,2,3),
labels=c("Without studies","Primary school",
"High school","University")),
copayment=factor(copayment,levels=c(0,1),
labels=c("less than 18000","more or equal than 18000")))
patient_inc$CC[patient_inc$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")
patient_inc%>%summary_factorlist(dependent,
explanatory,
total_col = TRUE,
cont="mean",
cont_cut=7,
na_include = TRUE,
na_to_prop = FALSE,
include_col_totals_percent =FALSE,
add_col_totals = TRUE)-> patient_inc_tab
kable(patient_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Female | Male | Total |
|---|---|---|---|---|
| Total N | 2728 | 2516 | 5244 | |
| age | Mean (SD) | 78.6 (13.4) | 78.6 (13.2) | 78.6 (13.3) |
| age_dx | Mean (SD) | 80.4 (13.5) | 80.4 (13.3) | 80.4 (13.4) |
| CC | AS | 1364 (50.0) | 1258 (50.0) | 2622 (50.0) |
| EU | 682 (25.0) | 629 (25.0) | 1311 (25.0) | |
| OC | 682 (25.0) | 629 (25.0) | 1311 (25.0) | |
| copayment | less than 18000 | 892 (49.4) | 904 (52.4) | 1796 (50.9) |
| more or equal than 18000 | 912 (50.6) | 820 (47.6) | 1732 (49.1) | |
| (Missing) | 924 | 792 | 1716 | |
| education | Without studies | 0 (NaN) | 0 (NaN) | 0 (NaN) |
| Primary school | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| High school | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| University | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| (Missing) | 2728 | 2516 | 5244 | |
| avg_income | (Missing) | 2728 | 2516 | 5244 |
Code
param_inc <- dbGetQuery(con,"SELECT * FROM param_incidents_predx_last_view")
param_list <- sort(unique(param_inc$param_name))
param_cat_inc <- dbGetQuery(con,
"SELECT *, \
CASE WHEN param_cat_name='physical_activity' AND param_cat_value=0 \
THEN 'Incapacity' \
WHEN param_cat_name='physical_activity' AND param_cat_value=1 \
THEN 'Inactive' \
WHEN param_cat_name='physical_activity' AND param_cat_value=2 \
THEN 'Partially Active' \
WHEN param_cat_name='physical_activity' AND param_cat_value=3 \
THEN 'Active' \
WHEN param_cat_name='smoking_status' AND param_cat_value=0 \
THEN 'Non Smoker' \
WHEN param_cat_name='smoking_status' AND param_cat_value=1 \
THEN 'Ex-smoker < 1 year' \
WHEN param_cat_name='smoking_status' AND param_cat_value=2 \
THEN 'Ex-smoker >= 1 year' \
WHEN param_cat_name='smoking_status' AND param_cat_value=3 \
THEN 'Smoker' \
WHEN param_cat_name='alcohol' AND param_cat_value=0 \
THEN 'Abstinent' \
WHEN param_cat_name='alcohol' AND param_cat_value=1 \
THEN 'Moderate Drinker' \
WHEN param_cat_name='alcohol' AND param_cat_value=2 \
THEN 'Heavy Drinker' \
WHEN param_cat_name='vaccination_flu' AND param_cat_value=0 \
THEN 'No' \
WHEN param_cat_name='vaccination_flu' AND param_cat_value=1 \
THEN 'Yes' \
WHEN param_cat_name='working_status' AND param_cat_value=0 \
THEN 'Active' \
WHEN param_cat_name='working_status' AND param_cat_value=1 \
THEN 'Unemployed' \
WHEN param_cat_name='working_status' AND param_cat_value=2 \
THEN 'Pensionist' \
ELSE '' END AS param_cat_value_str
FROM param_cat_incidents_predx_last_view WHERE param_cat_name!='vaccination_covid'")
param_cat_list <- sort(unique(param_cat_inc$param_cat_name))
dependent="sexo"
param_inc2unnest <- param_inc %>%
pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id")
param_cat_inc2unnest <- param_cat_inc %>%
pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value_str)%>%
left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id")
param_inc_tab <- rbind(
param_inc2unnest %>%
summary_factorlist(dependent ,
param_list,
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,add_row_totals = TRUE,
include_row_missing_col = TRUE,
include_col_totals_percent =FALSE,add_col_totals = TRUE),
param_cat_inc2unnest %>%
summary_factorlist(dependent ,
param_cat_list,
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,cont_cut=8,add_row_totals = TRUE,
include_row_missing_col = TRUE))
kable(param_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | Total N | Missing N | levels | Female | Male | Total |
|---|---|---|---|---|---|---|
| Total N | 2728 | 2516 | 5244 | |||
| alb | 2664 (50.8) | 2580 | Mean (SD) | 4.1 (0.5) | 4.0 (0.5) | 4.0 (0.5) |
| bmi | 2684 (51.2) | 2560 | Mean (SD) | 36.1 (10.1) | 36.0 (10.0) | 36.1 (10.1) |
| col | 2652 (50.6) | 2592 | Mean (SD) | 206.4 (60.4) | 214.2 (63.1) | 210.3 (61.8) |
| creat | 2756 (52.6) | 2488 | Mean (SD) | 1.7 (0.7) | 1.7 (0.8) | 1.7 (0.7) |
| dbp | 2576 (49.1) | 2668 | Mean (SD) | 77.5 (15.9) | 78.0 (15.6) | 77.7 (15.7) |
| filtglom | 2660 (50.7) | 2584 | Mean (SD) | 68.4 (30.1) | 68.2 (30.0) | 68.4 (30.0) |
| hba1c | 2716 (51.8) | 2528 | Mean (SD) | 8.3 (2.0) | 8.6 (2.0) | 8.5 (2.0) |
| hdl | 2664 (50.8) | 2580 | Mean (SD) | 57.4 (21.2) | 59.9 (21.6) | 58.7 (21.4) |
| height | 2728 (52.0) | 2516 | Mean (SD) | 1.7 (0.2) | 1.7 (0.2) | 1.7 (0.2) |
| indalbcr | 2624 (50.0) | 2620 | Mean (SD) | 591.0 (345.9) | 566.3 (330.6) | 579.0 (338.7) |
| ldl | 2692 (51.3) | 2552 | Mean (SD) | 126.1 (51.5) | 127.4 (50.8) | 126.7 (51.2) |
| sbp | 2672 (51.0) | 2572 | Mean (SD) | 140.2 (27.5) | 139.5 (27.6) | 139.8 (27.6) |
| trgl | 2676 (51.0) | 2568 | Mean (SD) | 240.2 (115.7) | 239.3 (111.1) | 239.8 (113.5) |
| weight | 2660 (50.7) | 2584 | Mean (SD) | 122.0 (46.3) | 123.9 (46.7) | 122.9 (46.5) |
| alcohol | 1588 (35.0) | 2944 | Abstinent | 272 (32.1) | 244 (33.0) | 516 (32.5) |
| Heavy Drinker | 312 (36.8) | 252 (34.1) | 564 (35.5) | |||
| Moderate Drinker | 264 (31.1) | 244 (33.0) | 508 (32.0) | |||
| family_history_CVD | 1816 (40.1) | 2716 | 972 (100.0) | 844 (100.0) | 1816 (100.0) | |
| physical_activity | 1668 (36.8) | 2864 | Active | 264 (29.3) | 204 (26.6) | 468 (28.1) |
| Inactive | 152 (16.9) | 192 (25.0) | 344 (20.6) | |||
| Incapacity | 248 (27.6) | 208 (27.1) | 456 (27.3) | |||
| Partially Active | 236 (26.2) | 164 (21.4) | 400 (24.0) | |||
| smoking_status | 1748 (38.6) | 2784 | Ex-smoker < 1 year | 188 (20.4) | 164 (19.8) | 352 (20.1) |
| Ex-smoker >= 1 year | 260 (28.3) | 232 (28.0) | 492 (28.1) | |||
| Non Smoker | 204 (22.2) | 180 (21.7) | 384 (22.0) | |||
| Smoker | 268 (29.1) | 252 (30.4) | 520 (29.7) | |||
| vaccination_flu | 1724 (38.0) | 2808 | Yes | 940 (100.0) | 784 (100.0) | 1724 (100.0) |
| working_status | 1616 (35.7) | 2916 | Active | 328 (38.9) | 256 (33.2) | 584 (36.1) |
| Pensionist | 244 (28.9) | 328 (42.5) | 572 (35.4) | |||
| Unemployed | 272 (32.2) | 188 (24.4) | 460 (28.5) |
Code
patient_inc$deregistration_date <- patient_inc$deregistration_date %>%
replace_na(as.Date('2023-01-01'))
patient_inc <- patient_inc %>%
mutate(follow_up_time=as.numeric(difftime(deregistration_date,
dx_date,units="days")/365.25)) %>%
filter(follow_up_time!=0)
ss_use_inc <- dbGetQuery(con," SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01') AND \
visit_date>= (SELECT dx_date FROM main.patient WHERE \
main.patient.patient_id = main.ss_use.patient_id)") %>% mutate(
visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
labels=c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional")))
ss_use_inc1 <- ss_use_inc %>%
filter(visit_loc == 1)
use_inc_freq1 <- as.data.frame.matrix(table(ss_use_inc1$patient_id,ss_use_inc1$visit_type))
use_inc_freq1 <- cbind(patient_id = rownames(use_inc_freq1), use_inc_freq1)
use_inc_freq1_time <- left_join(use_inc_freq1, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_inc2 <- ss_use_inc %>%
filter(visit_loc == 2)
use_inc_freq2 <- as.data.frame.matrix(table(ss_use_inc2$patient_id,ss_use_inc2$visit_type))
use_inc_freq2 <- cbind(patient_id = rownames(use_inc_freq2), use_inc_freq2)
use_inc_freq2_time <- left_join(use_inc_freq2, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_inc3 <- ss_use_inc %>%
filter(visit_loc == 3)
use_inc_freq3 <- as.data.frame.matrix(table(ss_use_inc3$patient_id,ss_use_inc3$visit_type))
use_inc_freq3 <- cbind(patient_id = rownames(use_inc_freq3), use_inc_freq3)
use_inc_freq3_time <- left_join(use_inc_freq3, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
use_inc_tab <- rbind(
rbind(data.frame(label = "at care center", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq1_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "at home", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq2_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "by telephone or similar", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq3_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)))
kable(use_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Male | Female | Total |
|---|---|---|---|---|
| at care center | ||||
| PC_physician | Mean (SD) | 0.6 (5.8) | 0.6 (4.6) | 0.6 (5.2) |
| PC_nurse | Mean (SD) | 1.3 (17.6) | 0.5 (5.5) | 0.9 (12.9) |
| PC_social_worker | Mean (SD) | 0.6 (6.0) | 0.3 (1.3) | 0.5 (4.2) |
| PC_emergency_service | Mean (SD) | 0.6 (8.4) | 0.4 (2.7) | 0.5 (6.1) |
| PC_others | Mean (SD) | 0.4 (2.1) | 0.3 (1.5) | 0.3 (1.8) |
| Specialized_visit_physician | Mean (SD) | 0.4 (2.1) | 0.4 (2.3) | 0.4 (2.2) |
| Specialized_visit_nurse | Mean (SD) | 0.4 (2.4) | 0.3 (1.3) | 0.3 (1.9) |
| Specialized_visit_unknown_professional | Mean (SD) | 1.0 (16.7) | 0.4 (2.6) | 0.7 (11.7) |
| at home | ||||
| PC_physician | Mean (SD) | 0.6 (3.8) | 0.2 (1.2) | 0.4 (2.8) |
| PC_nurse | Mean (SD) | 1.0 (17.3) | 0.4 (1.5) | 0.7 (11.9) |
| PC_social_worker | Mean (SD) | 0.5 (5.9) | 0.3 (1.5) | 0.4 (4.2) |
| PC_emergency_service | Mean (SD) | 0.7 (8.7) | 0.8 (6.6) | 0.7 (7.6) |
| PC_others | Mean (SD) | 0.3 (1.0) | 0.5 (3.1) | 0.4 (2.4) |
| Specialized_visit_physician | Mean (SD) | 0.2 (1.1) | 0.9 (7.0) | 0.6 (5.1) |
| Specialized_visit_nurse | Mean (SD) | 1.1 (17.3) | 0.4 (1.9) | 0.7 (11.9) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.3 (1.4) | 0.6 (4.8) | 0.5 (3.7) |
| by telephone or similar | ||||
| PC_physician | Mean (SD) | 0.4 (3.0) | 0.4 (2.0) | 0.4 (2.5) |
| PC_nurse | Mean (SD) | 0.3 (0.8) | 0.6 (3.3) | 0.4 (2.4) |
| PC_social_worker | Mean (SD) | 1.1 (17.2) | 0.5 (3.4) | 0.8 (12.1) |
| PC_emergency_service | Mean (SD) | 0.4 (2.7) | 0.6 (2.5) | 0.5 (2.6) |
| PC_others | Mean (SD) | 0.7 (6.4) | 0.5 (5.6) | 0.6 (6.0) |
| Specialized_visit_physician | Mean (SD) | 0.3 (2.1) | 0.3 (1.3) | 0.3 (1.7) |
| Specialized_visit_nurse | Mean (SD) | 1.1 (17.2) | 1.0 (10.2) | 1.1 (14.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.5 (5.8) | 0.5 (5.6) | 0.5 (5.7) |
Code
patient_condition <- dbGetQuery(con,
"SELECT patient_id,type AS initial_status,code, \
CASE WHEN code='end' THEN type \
WHEN code='bmi<30' THEN 'else' \
WHEN code IN ('bmi>=30','E66.01','E66.09','E66.1','E66.2','E66.8','E66.9') \
THEN 'ob' \
WHEN code='f_date' THEN 'f' \
WHEN code IN ('N18','N18.1','N18.2','N18.3','N18.30','N18.31','N18.32',
'N18.4','N18.5','N18.6','N18.9','I12','I12.0','I12.9','filtglom') THEN 'ckd' \
WHEN code IN ('I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
'I50.1', 'I50.9') THEN 'hf' \
WHEN code='death' THEN code \
ELSE 'cvd' END AS final_status, \
DATEDIFF('day',dx_date,date) AS duration \
FROM (SELECT pc.*,pi.dx_date \
FROM patient_condition pc LEFT JOIN patient_incidents_view pi ON \
pc.patient_id=pi.patient_id)")
transition_sum <- patient_condition %>%
group_by(initial_status, final_status) %>%
summarise(
total_N = n(), # Number of repetitions for each transition
first_quartile = quantile(duration, 0.25), # First quantile of duration
flow_time = quantile(duration, 0.5), # Median
third_quartile = quantile(duration, 0.75) # Third quantile of duration
) %>%
group_by(initial_status) %>%
mutate(transition_percentage = total_N / sum(total_N) * 100) %>%
arrange(initial_status, desc(total_N)) %>%
select(all_of(c("initial_status","final_status","total_N","transition_percentage",
"first_quartile","flow_time","third_quartile")))
transition_sum$initial_status = factor(transition_sum$initial_status,
levels = c('cvd','hf','ckd','f','ob','else'))
transition_sum$final_status = factor(transition_sum$final_status,
levels = c('cvd','hf','ckd','f','ob','else','death'))
process_matrix <- ggplot(
transition_sum, aes(x = final_status, y = initial_status)) +
geom_tile(aes(fill = flow_time), color = "white") +
geom_text(aes(label = total_N), vjust = 1) +
scale_fill_distiller() +
labs(title = "Predominant clinical conditions' changes",
x = "Final Status",
y = "Initial Status")+
theme_minimal() + # Apply a minimal theme for a cleaner look
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold")) # Adjust text sizes
ggsave(gsub("src/analysis-scripts/","",here("outputs/predominant_condition_matrix.png")),plot=process_matrix)Code
patient_pre <- dbGetQuery(con,
"SELECT *,
CASE WHEN sex = 0 THEN 'Male'
ELSE 'Female' END AS sexo,
FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25)
AS 'age',
FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25)
AS 'age_dx'FROM main.patient
WHERE dx_date < '2017-01-01'")
patient_pre <- patient_pre %>%
left_join( country2cont %>% select(country, CC), by = "country") %>%
mutate(education=factor(education, levels=c(0,1,2,3),
labels=c("Without studies","Primary school",
"High school","University")),
copayment=factor(copayment,levels=c(0,1),
labels=c("less than 18000","more or equal than 18000")))
patient_pre$CC[patient_pre$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")
patient_pre%>%summary_factorlist(dependent,
explanatory,
total_col = TRUE,
cont="mean",
cont_cut=7,
na_include = TRUE,
na_to_prop = FALSE,
include_col_totals_percent =FALSE,
add_col_totals = TRUE)-> patient_pre_tab
kable(patient_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Female | Male | Total |
|---|---|---|---|---|
| Total N | 7568 | 7188 | 14756 | |
| age | Mean (SD) | 78.6 (13.3) | 78.1 (13.4) | 78.4 (13.3) |
| age_dx | Mean (SD) | 75.6 (13.4) | 75.1 (13.5) | 75.3 (13.4) |
| CC | AS | 3784 (50.0) | 3594 (50.0) | 7378 (50.0) |
| EU | 1892 (25.0) | 1797 (25.0) | 3689 (25.0) | |
| OC | 1892 (25.0) | 1797 (25.0) | 3689 (25.0) | |
| copayment | less than 18000 | 2476 (50.0) | 2476 (50.4) | 4952 (50.2) |
| more or equal than 18000 | 2480 (50.0) | 2436 (49.6) | 4916 (49.8) | |
| (Missing) | 2612 | 2276 | 4888 | |
| education | Without studies | 0 (NaN) | 0 (NaN) | 0 (NaN) |
| Primary school | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| High school | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| University | 0 (NaN) | 0 (NaN) | 0 (NaN) | |
| (Missing) | 7568 | 7188 | 14756 | |
| avg_income | (Missing) | 7568 | 7188 | 14756 |
Code
param_pre <- dbGetQuery(con,"SELECT * FROM param_prevalents_pre_last_view")
param_list <- sort(unique(param_pre$param_name))
param_cat_pre <- dbGetQuery(con,
"SELECT *, \
CASE WHEN param_cat_name='physical_activity' AND param_cat_value=0 \
THEN 'Incapacity' \
WHEN param_cat_name='physical_activity' AND param_cat_value=1 \
THEN 'Inactive' \
WHEN param_cat_name='physical_activity' AND param_cat_value=2 \
THEN 'Partially Active' \
WHEN param_cat_name='physical_activity' AND param_cat_value=3 \
THEN 'Active' \
WHEN param_cat_name='smoking_status' AND param_cat_value=0 \
THEN 'Non Smoker' \
WHEN param_cat_name='smoking_status' AND param_cat_value=1 \
THEN 'Ex-smoker < 1 year' \
WHEN param_cat_name='smoking_status' AND param_cat_value=2 \
THEN 'Ex-smoker >= 1 year' \
WHEN param_cat_name='smoking_status' AND param_cat_value=3 \
THEN 'Smoker' \
WHEN param_cat_name='alcohol' AND param_cat_value=0 \
THEN 'Abstinent' \
WHEN param_cat_name='alcohol' AND param_cat_value=1 \
THEN 'Moderate Drinker' \
WHEN param_cat_name='alcohol' AND param_cat_value=2 \
THEN 'Heavy Drinker' \
WHEN param_cat_name='vaccination_flu' AND param_cat_value=0 \
THEN 'No' \
WHEN param_cat_name='vaccination_flu' AND param_cat_value=1 \
THEN 'Yes' \
WHEN param_cat_name='working_status' AND param_cat_value=0 \
THEN 'Active' \
WHEN param_cat_name='working_status' AND param_cat_value=1 \
THEN 'Unemployed' \
WHEN param_cat_name='working_status' AND param_cat_value=2 \
THEN 'Pensionist' \
ELSE '' END AS param_cat_value_str
FROM param_cat_prevalents_pre_last_view WHERE param_cat_name!='vaccination_covid'")
param_cat_list <- sort(unique(param_cat_pre$param_cat_name))
dependent="sexo"
param_pre2unnest <- param_pre %>%
pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id")
param_cat_pre2unnest <- param_cat_pre %>%
pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value_str)%>%
left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id")
param_pre_tab <- rbind(
param_pre2unnest %>%
summary_factorlist(dependent ,
param_list,
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,add_row_totals = TRUE,
include_row_missing_col = TRUE,
include_col_totals_percent =FALSE,add_col_totals = TRUE),
param_cat_pre2unnest %>%
summary_factorlist(dependent ,
param_cat_list,
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,cont_cut=8,add_row_totals = TRUE,
include_row_missing_col = TRUE))
kable(param_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | Total N | Missing N | levels | Female | Male | Total |
|---|---|---|---|---|---|---|
| Total N | 7368 | 7012 | 14380 | |||
| alb | 3472 (24.1) | 10908 | Mean (SD) | 4.0 (0.5) | 4.1 (0.6) | 4.1 (0.5) |
| bmi | 3528 (24.5) | 10852 | Mean (SD) | 35.6 (9.5) | 34.7 (9.5) | 35.2 (9.5) |
| col | 3548 (24.7) | 10832 | Mean (SD) | 203.9 (60.4) | 203.0 (62.8) | 203.4 (61.6) |
| creat | 3680 (25.6) | 10700 | Mean (SD) | 1.7 (0.8) | 1.7 (0.8) | 1.7 (0.8) |
| dbp | 3648 (25.4) | 10732 | Mean (SD) | 79.2 (16.0) | 77.2 (15.8) | 78.2 (15.9) |
| filtglom | 3660 (25.5) | 10720 | Mean (SD) | 67.0 (31.2) | 68.7 (29.9) | 67.8 (30.6) |
| hba1c | 3352 (23.3) | 11028 | Mean (SD) | 8.5 (2.1) | 8.6 (2.0) | 8.5 (2.1) |
| hdl | 3468 (24.1) | 10912 | Mean (SD) | 59.2 (21.8) | 58.1 (22.2) | 58.6 (22.0) |
| height | 3628 (25.2) | 10752 | Mean (SD) | 1.7 (0.2) | 1.7 (0.2) | 1.7 (0.2) |
| indalbcr | 3664 (25.5) | 10716 | Mean (SD) | 567.0 (341.6) | 578.8 (341.4) | 572.9 (341.5) |
| ldl | 3640 (25.3) | 10740 | Mean (SD) | 124.1 (51.1) | 122.9 (50.1) | 123.6 (50.6) |
| sbp | 3640 (25.3) | 10740 | Mean (SD) | 144.2 (27.5) | 140.7 (27.0) | 142.6 (27.3) |
| trgl | 3544 (24.6) | 10836 | Mean (SD) | 244.0 (120.0) | 242.4 (111.9) | 243.2 (116.0) |
| weight | 3476 (24.2) | 10904 | Mean (SD) | 118.6 (48.1) | 111.8 (47.1) | 115.3 (47.7) |
| alcohol | 1968 (22.8) | 6648 | Abstinent | 380 (35.7) | 312 (34.5) | 692 (35.2) |
| Heavy Drinker | 356 (33.5) | 288 (31.9) | 644 (32.7) | |||
| Moderate Drinker | 328 (30.8) | 304 (33.6) | 632 (32.1) | |||
| family_history_CVD | 2112 (24.5) | 6504 | 1104 (100.0) | 1008 (100.0) | 2112 (100.0) | |
| physical_activity | 1936 (22.5) | 6680 | Active | 248 (24.9) | 260 (27.7) | 508 (26.2) |
| Inactive | 216 (21.7) | 196 (20.9) | 412 (21.3) | |||
| Incapacity | 268 (26.9) | 240 (25.5) | 508 (26.2) | |||
| Partially Active | 264 (26.5) | 244 (26.0) | 508 (26.2) | |||
| smoking_status | 1904 (22.1) | 6712 | Ex-smoker < 1 year | 296 (28.9) | 208 (23.6) | 504 (26.5) |
| Ex-smoker >= 1 year | 284 (27.7) | 232 (26.4) | 516 (27.1) | |||
| Non Smoker | 216 (21.1) | 176 (20.0) | 392 (20.6) | |||
| Smoker | 228 (22.3) | 264 (30.0) | 492 (25.8) | |||
| vaccination_flu | 2004 (23.3) | 6612 | Yes | 1120 (100.0) | 884 (100.0) | 2004 (100.0) |
| working_status | 1924 (22.3) | 6692 | Active | 348 (34.5) | 264 (28.8) | 612 (31.8) |
| Pensionist | 336 (33.3) | 336 (36.7) | 672 (34.9) | |||
| Unemployed | 324 (32.1) | 316 (34.5) | 640 (33.3) |
Code
patient_pre$deregistration_date <- patient_pre$deregistration_date %>%
replace_na(as.Date('2023-01-01'))
patient_pre <- patient_pre %>%
mutate(follow_up_time=as.numeric(difftime(deregistration_date, as.Date('2016-12-31'),units="days")/365.25))%>%
filter(follow_up_time!=0)
ss_use_pre <- dbGetQuery(con," SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date <'2017-01-01') AND
visit_date>='2017-01-01'") %>% mutate(
visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
labels=c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"))
)
ss_use_pre1 <- ss_use_pre %>%
filter(visit_loc == 1)
use_pre_freq1 <- as.data.frame.matrix(table(ss_use_pre1$patient_id,ss_use_pre1$visit_type))
use_pre_freq1 <- cbind(patient_id = rownames(use_pre_freq1), use_pre_freq1)
use_pre_freq1_time <- left_join(use_pre_freq1, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_pre2 <- ss_use_pre %>%
filter(visit_loc == 2)
use_pre_freq2 <- as.data.frame.matrix(table(ss_use_pre2$patient_id,ss_use_pre2$visit_type))
use_pre_freq2 <- cbind(patient_id = rownames(use_pre_freq2), use_pre_freq2)
use_pre_freq2_time <- left_join(use_pre_freq2, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_pre3 <- ss_use_pre %>%
filter(visit_loc == 3)
use_pre_freq3 <- as.data.frame.matrix(table(ss_use_pre3$patient_id,ss_use_pre3$visit_type))
use_pre_freq3 <- cbind(patient_id = rownames(use_pre_freq3), use_pre_freq3)
use_pre_freq3_time <- left_join(use_pre_freq3, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
use_pre_tab <- rbind(
rbind(data.frame(label = "at care center", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq1_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "at home", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq2_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "by telephone or similar", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq3_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)))
kable(use_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Male | Female | Total |
|---|---|---|---|---|
| at care center | ||||
| PC_physician | Mean (SD) | 0.1 (2.3) | -0.0 (2.8) | 0.0 (2.6) |
| PC_nurse | Mean (SD) | 0.2 (6.8) | -0.1 (5.6) | 0.1 (6.2) |
| PC_social_worker | Mean (SD) | -0.3 (9.9) | 0.1 (5.2) | -0.1 (7.8) |
| PC_emergency_service | Mean (SD) | -0.1 (3.3) | 0.0 (2.2) | -0.1 (2.8) |
| PC_others | Mean (SD) | -0.1 (3.8) | 0.2 (5.7) | 0.0 (4.9) |
| Specialized_visit_physician | Mean (SD) | -0.1 (4.8) | 0.2 (12.2) | 0.0 (9.3) |
| Specialized_visit_nurse | Mean (SD) | -0.0 (13.3) | 0.1 (5.4) | 0.0 (10.0) |
| Specialized_visit_unknown_professional | Mean (SD) | -0.1 (2.8) | 0.3 (9.4) | 0.1 (7.0) |
| at home | ||||
| PC_physician | Mean (SD) | -0.2 (10.2) | 0.1 (4.2) | -0.0 (7.7) |
| PC_nurse | Mean (SD) | 0.4 (11.3) | 0.0 (6.6) | 0.2 (9.2) |
| PC_social_worker | Mean (SD) | -0.1 (4.5) | 0.1 (5.8) | 0.0 (5.2) |
| PC_emergency_service | Mean (SD) | -0.1 (3.3) | 0.1 (4.1) | 0.0 (3.7) |
| PC_others | Mean (SD) | 0.1 (4.5) | 0.1 (1.8) | 0.1 (3.4) |
| Specialized_visit_physician | Mean (SD) | -0.1 (4.0) | 0.3 (10.2) | 0.1 (7.8) |
| Specialized_visit_nurse | Mean (SD) | 0.3 (10.0) | 0.0 (7.6) | 0.2 (8.8) |
| Specialized_visit_unknown_professional | Mean (SD) | -0.2 (3.6) | 0.4 (11.1) | 0.1 (8.3) |
| by telephone or similar | ||||
| PC_physician | Mean (SD) | -0.0 (3.2) | 0.0 (4.8) | -0.0 (4.1) |
| PC_nurse | Mean (SD) | -0.2 (13.9) | -0.1 (5.2) | -0.1 (10.4) |
| PC_social_worker | Mean (SD) | 0.1 (10.0) | 0.3 (10.2) | 0.2 (10.1) |
| PC_emergency_service | Mean (SD) | -0.0 (4.8) | -0.2 (9.9) | -0.1 (7.8) |
| PC_others | Mean (SD) | -0.1 (3.9) | -0.0 (5.5) | -0.0 (4.7) |
| Specialized_visit_physician | Mean (SD) | -0.4 (11.5) | 0.1 (3.5) | -0.1 (8.4) |
| Specialized_visit_nurse | Mean (SD) | 0.1 (10.0) | 0.2 (6.2) | 0.2 (8.3) |
| Specialized_visit_unknown_professional | Mean (SD) | -0.1 (4.4) | 0.0 (6.0) | -0.0 (5.3) |
Patients without predominant clinical condition
Event log’s creation and description
Choosing patients without any predominant clinical condition and after some preprocessing we obtain an event log that can be represented in the below process map Figure 13. There is shown how the events are connected, the mean number of days patients spent in each event (or treatment), percentage of patients who made each transition and the mean number of days it took to make the transition. However, a spaghetti map is obtained and nothing can be concluded. Therefore, we have to simplify the process map and for example only show the most frequent traces covering 20% of the event log as in Figure 14. Moreover, in Figure 15 we show percentage of patients’ traces each activity is present.
Clustering traces
Once the set of traces to analyze are selected, there is a need to choose a distance measure to clustering. In this example Levenshtein similarity is chosen to calculate the distance matrix.
When distance matrix is acquired, we are able to cluster. However, the number of clusters have to be fixed before. With these figures we are able to conclude what could be the optimal number of clusters.
Choosing the optimal number of clusters, too small clusters can appear, but we can exclude those of less than 30 traces to avoid having clusters of low representation. The process map and the most frequent traces of one of these clusters that remain are the followings.
Conformace checking
Once traces are clusterized, with a boxplot is easy to show that each cluster’s behavior with respect to the treatment guides is different. Comparing traces with Figure 6, calculating the fitness (1 being perfect match with the petri net and 0 being the lowest possible fitness) of each trace and grouping by cluster results in_ Figure 19.
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 702, number of events= 63
coef exp(coef) se(coef) z Pr(>|z|)
age -0.02058 0.97963 0.02745 -0.750 0.454
sex1 -0.40668 0.66586 0.26622 -1.528 0.127
fitness -0.28259 0.75383 0.91797 -0.308 0.758
exp(coef) exp(-coef) lower .95 upper .95
age 0.9796 1.021 0.9283 1.034
sex1 0.6659 1.502 0.3952 1.122
fitness 0.7538 1.327 0.1247 4.557
Concordance= 0.55 (se = 0.044 )
Likelihood ratio test= 2.6 on 3 df, p=0.5
Wald test = 2.58 on 3 df, p=0.5
Score (logrank) test = 2.6 on 3 df, p=0.5
Using same predictive variables a joint latent class model of two classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 160
Number of observations: 702
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 63
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 3e-16
: likelihood= 6.2e-11
: second derivatives= 3.2e-11
Goodness-of-fit statistics:
maximum log-likelihood: 53.16
AIC: -78.31
BIC: -35.26
Score test statistic for CI assumption: 2.505 (p-value=0.2858)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.28349 0.35838 -0.791 0.42893
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.05883 0.04304 1.367 0.17169
event1 +/-sqrt(Weibull2) 1.15803 0.06954 16.653 0.00000
event1 SurvPH class1 1.90432 0.44799 4.251 0.00002
age -0.03056 0.03008 -1.016 0.30974
sex1 -0.34676 0.29674 -1.169 0.24258
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.85420 0.02527 33.809 0.00000
intercept class2 0.81311 0.01876 43.341 0.00000
t_0 class1 -0.00116 0.00010 -11.153 0.00000
t_0 class2 -0.00062 0.00004 -17.622 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.01693
t_0 -0.00002 0
coef Se
Residual standard error 0.08451 0.00276
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 57.00 103.00
% 35.62 64.38
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.7979 0.2021
class2 0.2258 0.7742
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 77.19 52.43
prob>0.8 54.39 45.63
prob>0.9 22.81 39.81
Posterior classification based only on longitudinal data:
class1 class2
N 54.00 106.00
% 33.75 66.25
png
2
CV Disease
Event log’s creation and description
Choosing patients with a cardiovascular disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 276, number of events= 41
coef exp(coef) se(coef) z Pr(>|z|)
age -0.005415 0.994600 0.012044 -0.450 0.653
sex1 -0.446039 0.640159 0.321092 -1.389 0.165
fitness -0.699740 0.496714 1.491856 -0.469 0.639
exp(coef) exp(-coef) lower .95 upper .95
age 0.9946 1.005 0.97140 1.018
sex1 0.6402 1.562 0.34117 1.201
fitness 0.4967 2.013 0.02668 9.246
Concordance= 0.589 (se = 0.05 )
Likelihood ratio test= 2.1 on 3 df, p=0.6
Wald test = 2.08 on 3 df, p=0.6
Score (logrank) test = 2.11 on 3 df, p=0.6
Using same predictive variables a joint latent class model of two classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 72
Number of observations: 276
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 41
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 2
Convergence criteria: parameters= 1.2e-09
: likelihood= 4.5e-06
: second derivatives= 1.7e-06
Goodness-of-fit statistics:
maximum log-likelihood: 34.24
AIC: -40.47
BIC: -8.6
Score test statistic for CI assumption: 0.096 (p-value=0.9534)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.40879 0.48094 -0.850 0.39533
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.08098 0.04862 1.666 0.09578
event1 +/-sqrt(Weibull2) 1.03113 0.08180 12.605 0.00000
event1 SurvPH class1 2.01516 0.60864 3.311 0.00093
age -0.02223 0.01732 -1.283 0.19936
sex1 -0.38288 0.39403 -0.972 0.33120
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.56277 0.02326 24.190 0.00000
intercept class2 0.53931 0.01701 31.704 0.00000
t_0 class1 -0.00091 0.00023 -3.951 0.00008
t_0 class2 -0.00033 0.00004 -8.405 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.00807
t_0 0.00000 0
coef Se
Residual standard error 0.05009 0.00292
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 29.00 43.00
% 40.28 59.72
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.7974 0.2026
class2 0.1306 0.8694
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 86.21 79.07
prob>0.8 58.62 65.12
prob>0.9 17.24 58.14
Posterior classification based only on longitudinal data:
class1 class2
N 19.00 53.00
% 26.39 73.61
png
2
Heart Failure
Event log’s creation and description
Choosing patients with heart failure and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces

Conformace checking
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 7, number of events= 1
coef exp(coef) se(coef) z Pr(>|z|)
age 3.995e-01 1.491e+00 1.551e+06 0 1
sex1 -3.196e+00 4.093e-02 1.241e+07 0 1
fitness 0.000e+00 1.000e+00 6.978e+08 0 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.49106 0.6707 0 Inf
sex1 0.04093 24.4318 0 Inf
fitness 1.00000 1.0000 0 Inf
Concordance= 1 (se = 0 )
Likelihood ratio test= 2.2 on 3 df, p=0.5
Wald test = 0 on 3 df, p=1
Score (logrank) test = 2 on 3 df, p=0.6
Using same predictive variables a joint latent class model of two classes is made:



Chronic kidney disease
Event log’s creation and description
Choosing patients with chronic kidney disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 97, number of events= 15
coef exp(coef) se(coef) z Pr(>|z|)
age 0.009163 1.009206 0.019369 0.473 0.6361
sex1 -0.091847 0.912245 0.557819 -0.165 0.8692
fitness -6.292425 0.001850 3.399764 -1.851 0.0642 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.00921 0.9909 9.716e-01 1.048
sex1 0.91224 1.0962 3.057e-01 2.722
fitness 0.00185 540.4623 2.362e-06 1.449
Concordance= 0.702 (se = 0.073 )
Likelihood ratio test= 3.85 on 3 df, p=0.3
Wald test = 3.67 on 3 df, p=0.3
Score (logrank) test = 3.83 on 3 df, p=0.3
Using same predictive variables a joint latent class model of two classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 26
Number of observations: 97
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 15
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 1.4e-11
: likelihood= 1.8e-06
: second derivatives= 8.7e-06
Goodness-of-fit statistics:
maximum log-likelihood: 7.6
AIC: 12.8
BIC: 30.42
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.00093 0.87912 -0.001 0.99916
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.02738 0.01728 1.585 0.11307
event1 +/-sqrt(Weibull2) 1.21071 0.17522 6.910 0.00000
event1 SurvPH class1 2.36080 1.05969 2.228 0.02589
age 0.00952 0.02416 0.394 0.69370
sex1 -0.76573 0.68034 -1.126 0.26038
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.53595 0.05122 10.465 0.00000
intercept class2 0.61907 0.02639 23.463 0.00000
t_0 class1 -0.00080 0.00010 -7.832 0.00000
t_0 class2 -0.00046 0.00004 -11.727 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.00439
t_0 0.00000 0
coef Se
Residual standard error 0.05782 0.00491
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 12.00 14.00
% 46.15 53.85
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.9251 0.0749
class2 0.1355 0.8645
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 91.67 85.71
prob>0.8 91.67 64.29
prob>0.9 75.00 50.00
Posterior classification based only on longitudinal data:
class1 class2
N 13 13
% 50 50
png
2
Frailty
Event log’s creation and description
Choosing patients with frailty and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.